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Abstract 

Our recent work on large-scale learning using 6-bit minwise hashing 1211 1221 was tested on the webspam dataset 
(about 24 GB in LibSVM format), which may be way too small compared to real datasets used in industry. Since we 
could not access the proprietary dataset used in 1311 for testing the Vowpal Wabbit (VW) hashing algorithm, in this 
paper we present an experimental study based on the expanded rcvl dataset (about 200 GB in LibSVM format). 

In our earlier report 1221 . the experiments demonstrated that, with merely 200 hashed values per data point, 6-bit 
minwise hashing can achieve similar test accuracies as VW with 10 6 hashed values per data point, on the webspam 
dataset. In this paper, our new experiments on the (expanded) rcvl dataset clearly agree with our earlier observation 
that 6-bit minwise hashing algorithm is substantially more accurate than VW hashing algorithm at the same storage. 
For example, with 2 14 (16384) hashed values per data point, VW achieves similar test accuracies as 6-bit hashing with 
merely 30 hashed values per data point. This is of course not surprising as the report | 22| has already demonstrated 
that the variance of the VW algorithm can be order of magnitude(s) larger than the variance of 6-bit minwise hashing. 
It was shown in [22] that VW has the same variance as random projections. 

At least in the context of search, minwise hashing has been widely used in industry. It is well-understood that 
the preprocessing cost is not a major issue because the preprocessing is trivially parallelizable and can be conducted 
off-line or combined with the data-collection process. Nevertheless, in this paper, we report that, even merely from 
the perspective of academic machine learning practice, the preprocessing cost is not a major issue for the following 
reasons: 

• The preprocessing incurs only a one-time cost. The same processed data can be used for many training exper- 
iments, for example, for many different "C" values in SVM cross-validation, or for different combinations of 
data splitting (into training and testing sets). 

• For training truly large-scale datasets, the dominating cost is often the data loading time. In our 200 GB dataset 
(which may be still very small according to the industry standard), the preprocessing cost of 6-bit minwise 
hashing is on the same order of magnitude as the data loading time. 

• Using a GPU, the preprocessing cost can be easily reduced to a small fraction (e.g.,< 1/7) of the data loading 
time. 

The standard industry practice of minwise hashing is to use universal hashing to replace permutations. In other 
words, there is no need to store any permutation mappings, one of the reasons why minwise hashing is popular. In this 
paper, we also provide experiments to verify this practice, based on the simplest 2-universal hashing, and illustrate 
that the performance of 6-bit minwise hashing does not degrade. 



1 Introduction 

Many machine learning applications are faced with large and inherently high-dimensional datasets. For example, ||29l 
discusses training datasets with (on average) 10 11 items and 10 9 distinct features. ll3D experimented with a dataset of 
potentially 16 trillion (1.6 x 10 13 ) unique features. Interestingly, while large-scale learning has become a very urgent, 
hot topic, it is usually very difficult for researchers from universities to obtain truly large, high-dimensional datasets 
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from industry. For example, the experiments in our recent work ||2T1 l22l on large-scale learning using 6-bit minwise 
hashing [23 , 24, 20 1 were based on the webspam dataset (about 24 GB in LibSVM format), which may be too small. 

To overcome this difficulty, we have generated a dataset of about 200 GB (in LibSVM format) from the rcvl 
dataset, using the original features + all pairwise combinations of features + 1/30 of the 3-way combinations of 
features. We choose 200 GB (which of course is still very small) because relatively inexpensive workstations with 192 
GB memory are in the market, which may make it possible for LIB LINEAR lfTTl[T5l . the popular solver for logistic 
regression and linear SVM, to perform the training of the entire dataset in main memory. We hope in the near future 
we will be able to purchase such a workstation. Of course, in this "information explosion" age, the growth of data is 
always much faster than the growth of memory capacity. 

Note that the our hashing method is orthogonal to particular solvers of logistic regression and SVM. We have 
tested 6-bit minwise hashing with other solvers 1(171 |27l l3l and observed substantial improvements. We choose LIB- 
LINEAR [ 1 1 1 as the work horse because it is a popular tool and may be familiar to non-experts. Our experiments may 
be easily validated by simply generating the hashed data off-line and feeding them to LIBLINEAR (or other solvers) 
without modification to the code. Also, we notice that the source code of LIBLINEAR, unlike many other excellent 
solvers, can be compiled in Visual Studio without modification. As many practitioners are using WINDOWS Q, we 
use LIBLINEAR throughout the paper, for the sake of maximizing the repeatability of our work. 

Unsurprisingly, our experimental results agree with our prior studies [22 1 that 6-bit minwise hashing is substan- 
tially more accurate than the Vowpal Wabbit ( VW) hashing algorithm OTl at the same storage. Note that in our paper, 
V W refers to the particular hashing algorithm in [ 3 1 1 , not the online learning platform that the authors of [ 3 1 28 1 have 
been developing. For evaluation purposes, we must separate out hashing algorithms from learning algorithms because 
they are orthogonal to each other. 

All randomized algorithms including minwise hashing and random projections rely on pseudo-random numbers. 
A common practice of minwise hashing (e.g., |4|) is to use universal hashing functions to replace perfect random 
permutations. In this paper, we also present an empirical study to verify that this common practice does not degrade 
the learning performance. 

Minwise hashing has been widely deployed in industry and 6-bit minwise hashing requires only minimal modi- 
fications. It is well-understood at least in the context of search that the (one time) preprocessing cost is not a major 
issue because the preprocessing step, which is trivially parallelizable, can be conducted off-line or combined in the 
data-collection process. In the context of pure machine learning research, one thing we notice is that for training truly 
large-scale datasets, the data loading time is often dominating [32|, for online algorithms as well as batch algorithms 
(if the data fit in memory). Thus, if we have to load the data many times, for example, for testing different "C" values 
in SVM or running an online algorithms for multiple epoches, then the benefits of data reduction algorithms such as 
6-bit minwise hashing would be enormous. 

Even on our dataset of 200 GB only, we observe that the preprocessing cost is roughly on the same order of mag- 
nitude as the data loading time. Furthermore, using a GPU (which is inexpensive) for fast hashing, we can reduce the 
preprocessing cost of 6-bit minwise hashing to a small fraction of the data loading time. In other words, the dominating 
cost is the still the data loading time. 

We are currently experimenting 6-bit minwise hashing for machine learning with 3> TB datasets and the results 
will be reported in subsequent technical reports. It is a very fun process to experiment with 6-bit minwise hashing and 
we certainly would like to share our experience with the machine learning and data mining community. 

2 Review Minwise Hashing and b-Bit Minwise Hashing 

Minwise hashing [0] [5] has been successfully applied to a very wide range of real-world problems especially in the 
context of search H \U\ H S |30] OH LlOj U [El [HI |26), for efficiently computing set similarities. 

Minwise hashing mainly works well with binary data, which can be viewed either as 0/1 vectors or as sets. Given 

Note that the current version of Cygwin has a very serious memory limitation and hence is not suitable for large-scale experiments, even though 
all popular solvers can be compiled under Cygwin. 
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two sets, Si, S2 C Q = {0, 1, 2, D — 1}, a widely used (normalized) measure of similarity is the resemblance R: 

R = r^JI = T"T7 > where * = N,/ 2 = \S 2 \, a = \Si n S 2 |. 

In this method, one applies a random permutation 7r : — > 57 on Si and 52- The collision probability is simply 

Pr (rnin(7T(Si)) = min(^(5 2 ))) = = R. 

One can repeat the permutation k times: m, tt 2 , 7Tfe to estimate R without bias, as 

1 k 

Rm = tJ2 = minfo (&))}, (1) 



3=1 



Var (r m ^ = ±R(1 - R). 



(2) 



The common practice of minwise hashing is to store each hashed value, e.g., min(7r(S'i)) and min^S^)), using 
64 bits |[T2l . The storage (and computational) cost will be prohibitive in truly large-scale (industry) applications l25l . 

In order to apply minwise hashing for efficiently training linear learning algorithms such as logistic regression or 
linear S VM, we need to express the estimator ([TJ as an inner product. For simplicity, we introduce 

Zi = min(7r i (5i)), z 2 = nrin(7rj(5 2 )), 

and we hope that the term \{zi — Z2] can be expressed as an inner product. Indeed, because 

D-l 

l{zi = z 2 } = Hzi =t}x l{z 2 = t} 
t=o 

we know immediately that the estimator (JTJ for minwise hashing is an inner product between two extremely high- 
dimensional (D x k) vectors. Each vector, which has exactly k l's, is a concatenation of k Z?-dimensional vectors. 
Because D = 2 64 is possible in industry applications, the total indexing space (D x k) may be too high to directly use 
this representation for training. 



The recent development of b-bit minwise hashing 1 23 24 20 1 provides a strikingly simple solution by storing only 
the lowest b bits (instead of 64 bits) of each hashed value. For convenience, we define 

e\ : i = zth lowest bit of Zi, e 2 ,i — ith lowest bit of z 2 . 

Theorem 1 R23\l Assume D is large (i.e., D — >• 00). 

Pb = Pr m 1 {e M = e 2 , t }^j = C hb + (1 - C 2 , b ) R, (3) 

/l /2 

n = -p, r 2 = —, fi = \Si\, f 2 = \s 2 \, 

C\h — ^4i,fc h A 2 h - — , C 2 h — Aih h A 2 .b- 



ri+r 2 ' ri+r 2 " ' r x + r 2 ""' n + r 2 

ri [1 - ri] 2 -1 r 2 [1 - r 2 ] 2 -1 

l-[l-n] 2 l-[l-r 2 ] 2 
As r\ — > and r 2 — > 0, the limits are 

1 

2 b 



A hb = A 2 . b = C lj6 = C 2 ,& = ^ (4) 



P b = i + (l-i)/? (?) 
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The case n , r% — > is very common in practice because the data are often relatively highly sparse (i.e., r±, r% ~ 0), 
although they can be very large in the absolute scale. For example, if D = 2 64 , then a set Si with f\ = \Si\ — 2 54 
(which roughly corresponds to the size of a small novel) is highly sparse (n ~ 0.001) even though 2 54 is actually very 
large in the absolute scale. One can also verify that the error by using (0 to replace (0 is bounded by 0(r\ + f2), 
which is very small when ri , T2 — > 0. In fact, [24] extensively used this argument for studying 3-way set similarities. 

We can then estimate P, (and R) from fc independent permutations: 7Ti, 7T2, 7Tfc, 



A- 



a 6 = f^f, a = \y: |n IK*.-, = <w J . ( 6 ) 

Var ( P ' 



V > [1-C 2 . b ] 2 k [l-C 2 , b ] 2 

Clearly, the similarity (R) information is adequately encoded in p,. In other words, often there is no need to 
explicitly estimate R. The estimator p, is an inner product between two vectors in 2 b x k dimensions with exactly 
k Vs. Therefore, if b is not too large (e.g., b < 16), this intuition provides a simple practical strategy for using 6-bit 
minwise hashing for large-scale learning. 



3 Integrating 6-Bit Minwise Hashing with (Linear) Learning Algorithms 

Linear algorithms such as linear SVM and logistic regression have become very powerful and extremely popular. 
Representative software packages include SVM perf H3, Pegasos (23, Bottou's SGD SVM 0, and LIBLINEAR ifTTIl . 

Given a dataset {(x i; j/i)}™ =1 , Xj £ R D , € { — 1,1}, the L 2 - re g u l ar i ze d linear SVM solves the following 
optimization problem: 

1 " 

min -w T w + C 7 max {l — ?/iW T Xi, 0} , (8) 
w 2 ^ — 4 

»=i 

and the L2-regularized logistic regression solves a similar problem: 

1 



' T - ' C 



mm — w w 

w 2 

i=l 



]Tlog(l + e-^ wTxi ) . (9) 



Here C > is an important penalty parameter. Since our purpose is to demonstrate the effectiveness of our pro- 
posed scheme using 6-bit hashing, we simply provide results for a wide range of C values and assume that the best 
performance is achievable if we conduct cross-validations. 

In our approach, we apply k independent random permutations on each feature vector x, and store the lowest b 
bits of each hashed value. This way, we obtain a new dataset which can be stored using merely nbk bits. At run-time, 
we expand each new data point into a 2 b x fc-length vector. 

For example, suppose k = 3 and the hashed values are originally {12013, 25964, 20191}, whose binary digits are 
{010111011101101, 110010101101100,100111011011111}. Consider b = 2. Then the binary digits are stored as 
{01, 00, 11} (which corresponds to {1, 0, 3} in decimals). At run-time, we need to expand them into a vector of length 
2 b k = 12, to be {0, 0,1,0, 0, 0, 0, 1, 1,0, 0, 0}, which will be the new feature vector fed to a solver: 

Original hashed values (k = 3) : 12013 25964 20191 

Original binary representations : 010111011101101 110010101101100 100111011011111 

Lowest b = 2 binary digits : 01 00 11 

Expanded 2 b = 4 binary digits : 0010 0001 1000 

New feature vector fed to a solver : {0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0} 
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4 Experimental Results of 6-Bit Minwise Hashing on Expanded RCV1 Dataset 



In our earlier technical reports [21, 22], our experimental settings closely followed the work in [32] by testing 6-bit 
minwise hashing on the webspam dataset (n = 350000, D = 16609143). Following 113211 . we randomly selected 
20% of samples for testing and used the remaining 80% samples for training. Since the webspam dataset (24 GB in 
LibS VM format) may be too small compared by datasets used in industry, in this paper we present an empirical study 
on the expanded rcvl dataset by using the original features + all pairwise combinations (products) of features + 1/30 
of 3-way combinations (products) of features. 

We chose LIBLINEAR as the tool to demonstrate the effectiveness of our algorithm. All experiments were con- 
ducted on workstations with Xeon(R) CPU (W5590@3.33GHz) and 48GB RAM, under Windows 7 System. 



Table 1 : Data information 



Dataset 


# Examples (n) 


# Dimensions (D) 


# Nonzeros Median (Mean) 


Train / Test Split 


Webspam (24 GB) 


350000 


16609143 


3889 (3728) 


80% / 20% 


Rcvl (200 GB) 


677399 


1010017424 


3051 (12062) 


50% / 50% 



Note that we used the original "testing" data of rcvl to generate our new expanded dataset. The original "training" 
data of rcvl had only 20242 examples. Also, to ensure reliable test results, we randomly split our expanded rcvl 
dataset into two halves, for training and testing. 



4.1 Experimental Results Using Linear SVM 

Since there is an important tuning parameter C in linear SVM and logistic regression, we conducted our extensive 
experiments for a wide range of C values (from 10 -3 to 10 2 ) with finer spacings in [0.1, 10]. 

We mainly experimented with k = 30 to k = 500, and b = 1, 2, 4, 8, 12, and 16. Figures Q] and [2] provide the 
test accuracies and train times, respectively. We hope in the near future we will add the baseline results by training 
LIBLINEAR on the entire (200 GB) dataset, once we have the resources to do so. Note that with merely k = 30, we 
can achieve > 90% test accuracies (using b = 12). The VW algorithm can also achieve 90% accuracies with k — 2 14 . 

For this dataset, the best performances were usually achieved when C > 1. Note that we plot all the results for 
different C values in one figure so that others can easily verify our work, to maximize the repeatability. 




4.2 Experimental Results Using Logistic Regression 

Figure [3]presents the test accuracy and Figure [Upresents the training time using logistic regression. Again, just like 
our experiments with SVM, using merely k — 30 and b — 12, we can achieve > 90% test accuracies; and using 
k > 300, we can achieve > 95% test accuracies. 
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Figure 4: Logistic regression training time on rcvl. 



5 Comparisons with Vowpal Wabbit (VW) 



The two methods, random projections HIED and Vowpal Wabbit ( VW) [31,281 are not limited to binary data (although 
for ultra high-dimensional used in the context of search, the data are often binary). The VW algorithm is also related 
to the Count-Min sketch |9 1. In this paper, we use "VW" particularly for the hashing algorithm in BP . 

Since VW has the same variance as random projections (RP), we first provide a review for both RP and VW. 



5.1 Random Projections (RP) 

For convenience, we denote two _D-dim data vectors by u\, u<z £ M. D . Again, the task is to estimate the inner product 

a = l^i=l u l,i u 2,i- 

The general idea is to multiply the data vectors, e.g., u\ and u%, by a random matrix {ry} £ R Dxk , where is 
sampled i.i.d. from the following generic distribution with Ifl9l 

E( nj ) = 0, Var{ nj ) = l, E{r%) = Q, E(r%) = s, s > 1. (10) 

We must have s > 1 because Var{rh) = E(rfj) — E 2 (rfj) = s - 1 > 0. 
This generates two fc-dim vectors, V\ and v 2 . 



D D 

Vi,j = },ui ti rjj, v 2 j =^2u 2 ,irij, j = l,2,...,k 



i=i 



i=l 



The general distributions which satisfy ( TTOb include the standard normal distribution (in this case, s — 3) and the 
'sparse projection" distribution specified as 



1 with prob. 
1 1] — \fs x < with prob. 1 — - 
- 1 with prob. ^ 

|fl9l provided the following unbiased estimator a rPyS of a and the general variance formula: 

D 



Var(a r 



1 h 



E(a rPtS ) = a = 22 u h 



U2,i, 



3=1 



k 



D 



U1M-2A + 



1=1 1=1 



vi=l 



i=l 



(ID 



(12) 



(13) 



which means s = 1 achieves the smallest variance. The only elementary distribution we know that satisfies ( fT0T > with 
s = 1 is the two point distribution in { — 1, 1} with equal probabilities, i.e., ([Til l with s = 1, 



5.2 Vowpal Wabbit (VW) 

Again, in this paper, "VW" always refers to the particular algorithm in (3T). VW may be viewed as a "bias-corrected" 
version of the Count-Min (CM) sketch algorithm [9|. In the original CM algorithm, the key step is to independently 
and uniformly hash elements of the data vectors to buckets £ {1, 2, 3, k} and the hashed value is the sum of the 
elements in the bucket. That is h(i) = j with probability jk where j £ {1, 2, k}. For convenience, we introduce 
an indicator function: 




1 if h(i) = j 
otherwise 



which allow us to write the hashed data as 

D D 

i=i i=i 
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The estimate a cm — Ylj=i w i,j w 2,j I s (severely) biased for the task of estimating the inner products. ll3T| pro- 
posed a creative method for bias-correction, which consists of pre-multiplying (element- wise) the original data vectors 
with a random vector whose entries are sampled i.i.d. from the two-point distribution in { — 1, 1} with equal probabil- 
ities, which corresponds to s = 1 in (fTTT i. 

11221 considered a more general situation, for any s > 1. After applying multiplication and hashing on u% and U2 
as in HP , me resultant vectors g\ and gi are 



9i ,3 



D 



92 ,j 



} J U2,ir i I ij , j = l,2,...,fc 



(14) 



where r t is defined as in (TO), i.e., E(n) = 0, E(rf) = 1, E(rf) = 0, £(r 4 ) = s. [22] proved that 



D 



u 2 . 



i=l 



Var(a v 



D 1 
= («-l)£ui,iui > i+r 



"2,t 



E - 2 E u i>i u 2,' 



(15) 



(16) 



The variance ( TToT l says we do need s = 1, otherwise the additional term (s — 1) J^iLi u i i u 2 i w iU not vanish even 
as the sample size k — > 00. In other words, the choice of random distribution in VW is essentially the only option if 
we want to remove the bias by pre-multiplying the data vectors (element-wise) with a vector of random variables. Of 
course, once we let s = 1, the variance dT6b becomes identical to the variance of random projections (fT3l l. 



5.3 Comparing 6-bit Minwise Hashing with RP and VW in Terms of Variances 

Each sample of fo-bit minwise hashing requires exactly b bits of storage. For VW, if we consider the number of bins 
k is smaller than the number of nonzeros in each data vector, then the resultant hashed data vectors are dense and we 
probably need 32 bits or 16 bits per hashed data entry (sample). [22| demonstrated that if each sample of VW needs 
32 bits, then VW needs 10 ~ 100 (or even 10000) times more space than fo-bit minwise hashing in order to achieve the 
same variance. Of course, when k is much larger than the number of nonzeros in each data vector, then the resultant 
hashed data vector will be sparse and the storage would be similar to the original data size. 

One reason why VW is not accurate is because the variance (TToT l (for s = 1) is dominated by the product of two 
marginal squared I2 norms Y^iLi u i i u 2 % even wnen tne inner product is zero. 



5.4 Experiments 

We experiment with VW using k = 2 5 , 2 6 , 2 7 , 2 8 , 2 9 , 2 10 , 2 11 , 2 12 , 2 13 , and 2 14 . Note that 2 14 = 16384. It is difficult 
to train LIB LINEAR with k = 2 15 because the training size of the hashed data by VWisclose to 48 GB when k = 2 15 . 

Figure |5]and Figure |6]plot the test accuracies for SVM and logistic regression, respectively. In each figure, every 
panel has the same set of solid curves for VW but a different set of dashed curves for different 6-bit minwise hashing. 
Since k ranges very large, here we choose to present the test accuracies against k. Representative C values (0.01, 0.1, 
1, 10) are selected for the presentations. 

From Figures [5] and [6] we can see clearly that 6-bit minwise hashing is substantially more accurate than VW at the 
same storage. In other words, in order to achieve the same accuracy, VW will require substantially more storage than 
6-bit minwise hashing. 

Figure |7]presents the training times for comparing VW with 8-bit minwise hashing. In this case, we can see that 
even at the same k, 8-bit hashing may have some computational advantages compared to VW. Of course, as it is clear 
that VW will require a much larger k in order to achieve the same accuracies as 8-bit minwise hashing, we know that 
the advantage of 6-bit minwise hashing in terms of training time reduction is also enormous. 
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Figure 5: SVM test accuracy on rcvl for comparing VW (solid) with 6-bit minwise hashing (dashed). Each panel 
plots the same results for VW and results for 6-bit minwise hashing for a different b. We select C = 0.01, 0.1, 1, 10. 
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Figure 7: Training time for SVM (left) and logistic regression (right) on rcvl for comparing VW with 8-bit minwise 
hashing. 
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Note that, as suggested in [22], the training time of 6-bit minwise hashing can be further reduced by applying an 
additional VW step on top of the data generated by 6-bit minwise hashing. This is because VW is an excellent tool for 
achieving compact indexing when the data dimension is (extremely) much larger than the average number of nonzeros. 
We conduct the experiments on rcvl with 6 = 16 and notice that this strategy indeed can reduce the training time of 
16-bit minwise hashing by a factor 2 or 3. 

6 Preprocessing Cost 

Minwise hashing has been widely used in (search) industry and 6-bit minwise hashing requires only very minimal (if 
any) modifications. Thus, we expect 6-bit minwise hashing will be adopted in practice. It is also well-understood in 
practice that we can use (good) hashing functions to very efficiently simulate permutations. 

In many real-world scenarios, the preprocessing step is not critical because it requires only one scan of the data, 
which can be conducted off-line (or on the data-collection stage, or at the same time as n-grams are generated), and 
it is trivially parallelizable. In fact, because 6-bit minwise hashing can substantially reduce the memory consumption, 
it may be now affordable to store considerably more examples in the memory (after 6-bit hashing) than before, to 
avoid (or minimize) disk IOs. Once the hashed data have been generated, they can be used and re-used for many 
tasks such as supervised learning, clustering, duplicate detections, near-neighbor search, etc. For example, a learning 
task may need to re-use the same (hashed) dataset to perform many cross-validations and parameter tuning (e.g., for 
experimenting with many C values in logistic regression or SVM). 

For training truly large-scale datasets, often the data loading time can be dominating [32|. Table [2] compares 
the data loading times with the preprocessing times. For both webspam and rcvl datasets, when using a GPU, the 
preprocessing time for k = 500 permutations is only a small fraction of the data loading time. Without GPU, the 
preprocessing time is about 3 or 4 times higher than the data loading time, i.e., they are roughly on the same order 
of magnitudes. When the training datasets are much larger than 200 GB, we expect that difference between the data 
loading time and the preprocessing time will be much smaller, even without GPU. We would like to remind that the 
preprocessing time is only a one-time cost. 

Table 2: The data loading and preprocessing (for k = 500 permutations) times (in seconds). 



Dataset 


Data Loading 


Preprocessing 


Preprocessing with GPU 


Webspam (24 GB) 


9.7 x W 2 


41 x 10^ 


0.49 x 10' 2 


Rcvl (200 GB) 


1.0 x 10 4 


3.0 x 10 4 


0.14 x 10 4 



7 Simulating Permutations Using 2-Universal Hashing 

Conceptually, minwise hashing requires k permutation mappings irj : ft — !• Cl,j = l to k, where O = {0, 1, D — 
1}. If we are able to store these k permutation mappings, then the operation is straightforward. For practical indus- 
trial applications, however, storing permutations would be infeasible. Instead, permutations are usually simulated by 
universal hashing, which only requires storing very few numbers. 

The simplest (and possibly the most popular) approach is to use 2-universal hashing. That is, we define a series of 
hashing functions hj to replace izf 

= + c 2 ,j t mod p} mod D, j = 1,2, ...,fc, (17) 

where p > D is a prime number and cij is chosen uniformly from {0, 1, p — 1} and C2.j is chosen uniformly from 
{1, 2, ...,p— 1}. This way, instead of storing ttj, we only need to store 2k numbers, cij,C2,j, j = 1 to k. There are sev- 
eral small "tricks" for speeding up 2-universal hashing (e.g., avoiding modular arithmetic). An interesting thread might 
be |http : / /mybiasedcoin .blogspot . com/2 00 9/ 12 /text -book- algorithms -at -soda- guest -post . html| 

Given a feature vector (e.g., a document parsed as a list of 1-gram, 2-gram, and 3-grams), for any nonzero location 
t in the original feature vector, its new location becomes hj(t); and we walk through all nonzeros locations to find 
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the minimum of the new locations, which will be the jth hashed value for that feature vector. Since the generated 
parameters, c\,j and C2j, are fixed (and stored), this procedure becomes deterministic. 

Our experiments on webspam can show that even with this simplest hashing method, we can still achieve good 
performance compared to using perfect random permutations. We can not realistically store k permutations for the 
rcvl dataset because its D = 10 9 . Thus, we only verify the practice of using 2-universal hashing on the webspam 
dataset, as demonstrated in Figured] 




C C 

Figure 8: Test accuracies on webspam for comparing permutations (dashed) with 2-universal hashing (solid) (av- 
eraged over 50 runs), for both linear SVM (left) and logistic regression (right). We can see that the solid curves 
essentially overlap the dashed curves, verifying that even the simplest 2-universal hashing method can be very effec- 
tive. 



8 Conclusion 

It has been a lot of fun to develop 6-bit minwise hashing and apply it to machine learning for training very large-scale 
datasets. We hope engineers will find our method applicable to their work. We also hope this work can draw interests 
from research groups in statistics, theoretical CS, machine learning, or search technology. 
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